Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable predicate pushdown for categorical dimension filters #1227

Merged

Conversation

tlento
Copy link
Contributor

@tlento tlento commented May 22, 2024

Enable predicate pushdown for categorical dimension filters

We now have the ability to push down filter predicates within
the DataflowPlan. We start with categorical dimension filters,
as they are the simplest.

This change simply tracks the where filters applied at the measure
node and pushes all of them down to the construction of the source
node for evaluation. At this time a filter is eligible to be applied
to the source node if it only contains references to categorical dimensions
that originate from the same, singular semantic model definition that
feeds into the source node in question.

We do not support time dimensions at this time, as they can cause strange
interactions with things like cumulative metrics, which could result in
inappropriate input filtering that produces non-obviously censored metric
results.

We also do not support entities at this time, as entities may be defined
in multiple semantic models and as such filters must be applied with more
care to ensure we are correctly accounting for the entity link paths to
the relevant source node, if any, when we apply the filter.

Finally, we are not able to safely push predicates down for the "null value"
side of an outer join, which, in practice, restricts us to only doing predicate
pushdown to the measure source nodes.

The snapshot test changes for existing snapshots highlight the new behavior,
while the added test snapshots demonstrate specific circumstances of
interest.

Copy link
Contributor Author

tlento commented May 22, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @tlento and the rest of your teammates on Graphite Graphite

@tlento tlento force-pushed the add-element-type-to-linkable-element-interface branch from 9040f9f to ee88ff9 Compare May 23, 2024 23:41
@tlento tlento force-pushed the enable-predicate-pushdown-for-categorical-dimension-filters branch from 8b41889 to fba565c Compare May 23, 2024 23:42
@tlento tlento changed the base branch from add-element-type-to-linkable-element-interface to add-semantic-model-origin-to-linkable-element-interface May 23, 2024 23:42
@tlento tlento added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label May 24, 2024
@tlento tlento had a problem deploying to DW_INTEGRATION_TESTS May 24, 2024 00:11 — with GitHub Actions Failure
@tlento tlento had a problem deploying to DW_INTEGRATION_TESTS May 24, 2024 00:11 — with GitHub Actions Failure
@tlento tlento had a problem deploying to DW_INTEGRATION_TESTS May 24, 2024 00:11 — with GitHub Actions Failure
@tlento tlento had a problem deploying to DW_INTEGRATION_TESTS May 24, 2024 00:11 — with GitHub Actions Failure
Copy link
Contributor Author

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick note - the intervening commits show where I got a little too aggressive with pushdown, and the snapshots show one of the issues with the "during construction" approach - it's exceedingly difficult to properly push the filter down without duplicating it in the query, because the recursive tracking of which filters got pushed and which ones didn't is super complicated. Some of these pushdown operations are effectively undoing the subquery reducing optimizer, but have no other practical effect.

Future updates will:

  1. Add some dedicated join + filter test cases to ensure things keep returning the same results in the face of predicate pushdown. We incidentally seem to be covering the currently operating edge cases, but having explicit tests for all of the boundary conditions will be helpful as we restructure.
  2. Finish consolidating our pushdown logic - including time window adjustments - into the pushdown state object
  3. Convert our process to an optimizer pass and remove most predicate management from the DataflowPlanBuilder
  4. Fix up the pushdown rendering so we don't double-render filters without justification

Whether those come before or after basic support for time dimensions remains to be seen, but I think it likely we'll do them first.

Comment on lines 34 to 36
) subq_21
WHERE booking__is_instant
) subq_23
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty dumb. If the filter is defined on a measure we typically apply it immediately anyway. The only way to not do this is to be more clever about storing recursive state and ensuring that predicates only get pushed past joins.

For now I think this will do, but it's something we should probably resolve when we add support for time filters, as simple queries with time filters will often do an unnecessary pushdown.

@tlento tlento marked this pull request as ready for review May 24, 2024 00:31
@tlento tlento added Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment and removed Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment labels May 24, 2024
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS May 24, 2024 00:35 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS May 24, 2024 00:35 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS May 24, 2024 00:35 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS May 24, 2024 00:35 — with GitHub Actions Inactive
@github-actions github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label May 24, 2024
Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the snapshots and tests look good.
After reading this, I'm still confused as to how we ensure only categorical dimensions get pushed down 🤔 So trying to dig in to understand that before I approve!

@@ -680,7 +691,9 @@ def _build_plan_for_distinct_values(self, query_spec: MetricFlowQuerySpec) -> Da
required_linkable_specs, _ = self.__get_required_and_extraneous_linkable_specs(
queried_linkable_specs=query_spec.linkable_specs, filter_specs=query_level_filter_specs
)
predicate_pushdown_state = PredicatePushdownState(time_range_constraint=query_spec.time_range_constraint)
predicate_pushdown_state = PredicatePushdownState(
time_range_constraint=query_spec.time_range_constraint, where_filter_specs=query_level_filter_specs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this get narrowed down to only categorical dimensions? 🤔

@@ -22,17 +22,34 @@ FROM (
-- Join Standard Outputs
-- Pass Only Elements: ['average_booking_value', 'listing__is_lux_latest', 'booking__is_instant']
SELECT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but I've been wondering if we can optimize away these interim subqueries that do nothing but select columns 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're supposed to get optimized away, and I'm not entirely sure why they don't. There may be some column or subquery alias thing that's getting in the way.

Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looooks goood!!!
Left a couple of inline nits, nothing blocking!

invalid_element_types = [
element for element in spec.linkable_elements if element.element_type not in enabled_element_types
]
if len(semantic_models) == 1 and len(invalid_element_types) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so we only push down filters if ALL the filtered elements are eligible element types. Is that because we don't know if this is an AND or an OR filter at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, we cannot push down a filter with any invalid element types. The AND vs OR nature of things isn't relevant, it's because we don't have a way to handle those element types. Right now that's just because we haven't implemented handling, but in future it could be due to a given query being too difficult to manage for a given element type.

For example, agg time dimension filters against a mixture of cumulative and derived offset metric inputs could get very tricky. In those cases we may not be able to push down a where filter with a time dimension.

My expectation is that this will be more refined than clobbering anything that has a time dimension of any kind in it, but for now this definitely works and we can use more finesse later.

eligible_filter_specs_by_model: Dict[SemanticModelReference, Sequence[WhereFilterSpec]] = {}
for spec in where_filter_specs:
semantic_models = set(element.semantic_model_origin for element in spec.linkable_elements)
invalid_element_types = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this is the logic I was looking for 👍

]
if len(matching_filter_specs) == 0:
filtered_nodes.append(source_node)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this continue actually do anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I think originally I didn't have an else here. I've removed the redundant continue statement to keep the structure of this consistent with the time range constraint method.

matching_filter_specs = [
filter_spec
for filter_spec in eligible_filter_specs
if all([spec in source_node_specs.linkable_specs for spec in filter_spec.linkable_specs])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a time where this won't be true, since we get the semantic model from the linkable element above? Or is this just an extra safety check in case something gets misconfigured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this time it's a safeguard against something weird happening where a given source node isn't configured correctly. However, I expect this filter to be relevant for entities, since they may be defined in multiple semantic models and we need to be able to explicitly allow or disallow pushdown in those cases. If we ever add a pre-joined source node, for example, we might encounter a scenario where the entity and dimension come from different semantic models and then we couldn't push down past this point (and maybe shouldn't push down to this point, either).

@tlento tlento force-pushed the add-semantic-model-origin-to-linkable-element-interface branch from 7672db8 to 506ca50 Compare May 29, 2024 23:24
@tlento tlento force-pushed the enable-predicate-pushdown-for-categorical-dimension-filters branch from 2e250d1 to aea5c0b Compare May 29, 2024 23:25
Copy link
Contributor Author

tlento commented Jun 25, 2024

Merge activity

  • Jun 24, 8:29 PM PDT: @tlento started a stack merge that includes this pull request via Graphite.
  • Jun 24, 8:31 PM PDT: Graphite rebased this pull request as part of a merge.
  • Jun 24, 8:34 PM PDT: @tlento merged this pull request with Graphite.

Base automatically changed from add-semantic-model-origin-to-linkable-element-interface to main June 25, 2024 03:30
tlento added 7 commits June 25, 2024 03:30
We now have the ability to push down filter predicates within
the DataflowPlan. We start with categorical dimension filters,
as they are the simplest.

This change simply tracks the where filters applied at the measure
node and pushes all of them down to the construction of the source
node for evaluation. At this time a filter is eligible to be applied
to the source node if it only contains references to categorical dimensions
that originate from the same, singular semantic model definition that
feeds into the source node in question.

We do not support time dimensions at this time, as they can cause strange
interactions with things like cumulative metrics, which could result in
inappropriate input filtering that produces non-obviously censored metric
results.

We also do not support entities at this time, as entities may be defined
in multiple semantic models and as such filters must be applied with more
care to ensure we are correctly accounting for the entity link paths to
the relevant source node, if any, when we apply the filter.

The snapshot test changes for existing snapshots highlight the new behavior,
while the added test snapshots demonstrate specific circumstances of
interest.
Note for reviewers, as this will be squashed:

Pushdown on joined in dimensions was not working as expected. In
the original predicate pushdown rendering tests, most joined in
dimensions were being skipped for pushdown operations.

The root cause of the issue was the semantic model source value
we were accessing, which actually included the complete history
of all semantic model inputs for the joined in dimension.

Fixing that problem uncovered a separate issue, which is we were
inappropriately pushing filters down past outer join expressions.

This commit fixes both issues at once - we now only push down
on the "safe" side of an outer join (the left side for LEFT OUTER
and not at all for FULL OUTER joins), and we evaluate pushdown
based on the singuolar semantic model source where each element is
defined.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants